Intro to text mining - Text processing - 3

One should look for what is and not what he thinks should be. (Albert Einstein)

Field trip

  • View https://databasic.io/en/wordcounter/
  • Select your favorite artist under the “use a sample” menu and hit “COUNT”
  • Does the data tell a story about the selected artist?
  • What additional type of analysis of this data might be interesting?

Module completion checklist

Objective Complete
Create Term-Document Matrix
Explore the distribution of words in corpus

What is a Document-Term Matrix (DTM)?

centered

  • Document-term matrix is simply a matrix of unique words counted in each document:

    • Documents are arranged in rows
    • Unique terms are arranged in columns
  • The corpus vocabulary consists of all of the unique terms (i.e. column names of DTM) and their total counts across all documents (i.e. column sums)

  • A Term-Document Matrix will be just the transpose of the Document-Term Matrix, with terms in rows and documents in columns

Create DTM with CountVectorizer

  • Another very powerful platform in Python is scikit-learn, it is used heavily for machine learning. You can find complete documentation here.
  • To create a Document-Term Matrix, we will use CountVectorizer from scikit-learn library’s feature_extraction module for working with text

Create DTM with CountVectorizer (cont’d)

  • It takes a list of character strings that represent the documents as the main argument, passed to its fit_transform() method:

    • .fit_transform(list_of_documents)
  • It returns a 2D array (i.e. a matrix) with documents in rows and terms in columns - the DTM

centered-border

Create a DTM

# Initialize `CountVectorizer`.
vec = CountVectorizer()
# Transform the list of documents clean documents `df_clean_list` into DTM.
X = vec.fit_transform(df_clean_list)
print(X.toarray()) #<- to show output as a matrix
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
  • To get a list of names of columns (i.e. the unique terms in our corpus), we can use a utility method .get_feature_names_out()
print(vec.get_feature_names_out()[:10])
['abduct' 'abl' 'abo' 'absente' 'abus' 'academ' 'accept' 'access'
 'accessori' 'accommod']

Create a DTM (cont’d)

  • Let’s convert the matrix into a dataframe, where rows are IDs of the documents and columns are unique words that appear in those documents
# Convert the matrix into a Pandas DataFrame for easier manipulation.
DTM = pd.DataFrame(X.toarray(), columns = vec.get_feature_names_out())
print(DTM.head())
   abduct  abl  abo  absente  abus  academ  accept  access  ...  year  yell  yet  york  young  yuan  zimbabw  zykera
0       0    0    0        0     0       0       0       0  ...     0     0    0     0      0     0        0       0
1       0    0    0        0     0       0       0       0  ...     0     0    0     0      0     0        0       0
2       0    0    0        0     0       0       0       0  ...     0     0    0     0      1     0        0       0
3       0    0    0        0     0       0       0       0  ...     0     0    0     0      0     0        0       0
4       0    0    0        0     0       0       0       0  ...     0     0    0     0      0     0        0       0

[5 rows x 1921 columns]

DTM to dictionary of total word counts

  • NLTK word frequency visualization functions work with dictionaries
  • Before we convert our DTM to a dictionary, let’s create a convenience function that sorts all words in descending order by counts and displays the first n entries
  • We use lambda within the function
# Create a convenience function that sorts and looks at first n-entries in the dictionary.
def HeadDict(dict_x, n):
    # Get items from the dictionary and sort them by
    # value key in descending (i.e. reverse) order
    sorted_x = sorted(dict_x.items(), 
                      reverse = True, 
                      key = lambda kv: kv[1])
                      
    # Convert sorted dictionary to a list.
    dict_x_list = list(sorted_x)
    
    # Return the first `n` values from the dictionary only.
    return(dict(dict_x_list[:n]))

DTM to dictionary of total word counts (cont’d)

# Sum frequencies of each word in all documents.
DTM.sum(axis = 0).head()
abduct     1
abl        1
abo        1
absente    1
abus       2
dtype: int64
# Save series as a dictionary.
corpus_freq_dist = DTM.sum(axis = 0).to_dict()
# Glance at the frequencies.
print(HeadDict(corpus_freq_dist, 6))
{'said': 38, 'new': 36, 'presid': 28, 'year': 27, 'friday': 22, 'govern': 22}

“Bag-of-words” analysis: key elements

  • We have one more step remaining to learn, we will cover this next!
What we need What we have learned
A corpus of documents cleaned and processed in a certain way
  • All words a converted to lowercase
  • All punctuation, numbers and special characters are removed
  • Stopwords are removed
  • Words are stemmed to their root form

A Document-Term Matrix (DTM): with counts of each word recorded for each document

A transformed representation of a Document-Term Matrix (i.e. weighted with TF-IDF weights)

Module completion checklist

Objective Complete
Create Term-Document Matrix

Explore the distribution of words in corpus

Plot distribution of words in document corpus

# Save as a FreqDist object native to nltk.
corpus_freq_dist = nltk.FreqDist(corpus_freq_dist)
# Plot distribution for the entire corpus.
plt.figure(figsize = (50, 10))
corpus_freq_dist.plot(80)

Visualizing words using n-grams

  • An n-gram is a sequence of words that occur together in a sentence

  • This concept is used extensively in the field of Natural Language Processing to build n-gram based models which have various use cases like:

    • Predicting next words in a sentence based on their probability
    • Correcting spelling based errors in a sentence
  • We’re not going to dig deeper into these models in this course, but let’s learn how to create n-grams using nltk package!

Visualizing words using n-grams

  • An n-gram is called:

    • uni-gram when the value of n is set to 1,
    • bi-gram when n = 2,
    • tri-gram when n = 3, and so on
  • To create an n-gram, we will use ngrams from nltk' library'sutil` module. You can find the complete documentation for the ngrams here

centered

Visualizing words using n-grams

  • Let’s create a basic bi-gram and a tri-gram using the ngrams function
from nltk.util import ngrams

print(df_clean_list[0])
nick kyrgio start brisban open titl defens battl victori american ryan harrison open round tuesday
word = df_clean_list[0].split()
print(list(ngrams(word, 2)))  #<- set value of n as 2
[('nick', 'kyrgio'), ('kyrgio', 'start'), ('start', 'brisban'), ('brisban', 'open'), ('open', 'titl'), ('titl', 'defens'), ('defens', 'battl'), ('battl', 'victori'), ('victori', 'american'), ('american', 'ryan'), ('ryan', 'harrison'), ('harrison', 'open'), ('open', 'round'), ('round', 'tuesday')]
print(list(ngrams(word, 3)))  #<- set value of n as 3
[('nick', 'kyrgio', 'start'), ('kyrgio', 'start', 'brisban'), ('start', 'brisban', 'open'), ('brisban', 'open', 'titl'), ('open', 'titl', 'defens'), ('titl', 'defens', 'battl'), ('defens', 'battl', 'victori'), ('battl', 'victori', 'american'), ('victori', 'american', 'ryan'), ('american', 'ryan', 'harrison'), ('ryan', 'harrison', 'open'), ('harrison', 'open', 'round'), ('open', 'round', 'tuesday')]

Convenience function to generate n-grams

  • Let’s create a convenience function to generate bi-grams, tri-grams and four-grams for a subset of documents in our corpus!
def generate_ngrams(df_clean_list):
  for i in range(len(df_clean_list)):
      for n in range(2, 4):
          n_grams = ngrams(df_clean_list[i].split(), n)
          for grams in n_grams:
              print(grams)
generate_ngrams(df_clean_list[0:10])
('nick', 'kyrgio')
('kyrgio', 'start')
('start', 'brisban')
('brisban', 'open')
('open', 'titl')
('titl', 'defens')
('defens', 'battl')
('battl', 'victori')
('victori', 'american')
('american', 'ryan')
('ryan', 'harrison')
('harrison', 'open')
('open', 'round')
('round', 'tuesday')
('nick', 'kyrgio', 'start')
('kyrgio', 'start', 'brisban')
('start', 'brisban', 'open')
('brisban', 'open', 'titl')
('open', 'titl', 'defens')
('titl', 'defens', 'battl')
('defens', 'battl', 'victori')
('battl', 'victori', 'american')
('victori', 'american', 'ryan')
('american', 'ryan', 'harrison')
('ryan', 'harrison', 'open')
('harrison', 'open', 'round')
('open', 'round', 'tuesday')
('british', 'polic')
('polic', 'confirm')
('confirm', 'tuesday')
('tuesday', 'treat')
('treat', 'stab')
('stab', 'attack')
('attack', 'injur')
('injur', 'three')
('three', 'peopl')
('peopl', 'manchest')
('manchest', 'victoria')
('victoria', 'train')
('train', 'station')
('station', 'terrorist')
('terrorist', 'investig')
('investig', 'search')
('search', 'address')
('address', 'cheetham')
('cheetham', 'hill')
('hill', 'area')
('area', 'citi')
('british', 'polic', 'confirm')
('polic', 'confirm', 'tuesday')
('confirm', 'tuesday', 'treat')
('tuesday', 'treat', 'stab')
('treat', 'stab', 'attack')
('stab', 'attack', 'injur')
('attack', 'injur', 'three')
('injur', 'three', 'peopl')
('three', 'peopl', 'manchest')
('peopl', 'manchest', 'victoria')
('manchest', 'victoria', 'train')
('victoria', 'train', 'station')
('train', 'station', 'terrorist')
('station', 'terrorist', 'investig')
('terrorist', 'investig', 'search')
('investig', 'search', 'address')
('search', 'address', 'cheetham')
('address', 'cheetham', 'hill')
('cheetham', 'hill', 'area')
('hill', 'area', 'citi')
('marcellu', 'wiley')
('wiley', 'still')
('still', 'fenc')
('fenc', 'let')
('let', 'young')
('young', 'son')
('son', 'play')
('play', 'footbal')
('footbal', 'former')
('former', 'nfl')
('nfl', 'defens')
('defens', 'end')
('end', 'fox')
('fox', 'sport')
('sport', 'person')
('person', 'tell')
('tell', 'podcaston')
('podcaston', 'sport')
('sport', 'like')
('like', 'nfl')
('nfl', 'tri')
('tri', 'make')
('make', 'footbal')
('footbal', 'safer')
('safer', 'game')
('game', 'de')
('marcellu', 'wiley', 'still')
('wiley', 'still', 'fenc')
('still', 'fenc', 'let')
('fenc', 'let', 'young')
('let', 'young', 'son')
('young', 'son', 'play')
('son', 'play', 'footbal')
('play', 'footbal', 'former')
('footbal', 'former', 'nfl')
('former', 'nfl', 'defens')
('nfl', 'defens', 'end')
('defens', 'end', 'fox')
('end', 'fox', 'sport')
('fox', 'sport', 'person')
('sport', 'person', 'tell')
('person', 'tell', 'podcaston')
('tell', 'podcaston', 'sport')
('podcaston', 'sport', 'like')
('sport', 'like', 'nfl')
('like', 'nfl', 'tri')
('nfl', 'tri', 'make')
('tri', 'make', 'footbal')
('make', 'footbal', 'safer')
('footbal', 'safer', 'game')
('safer', 'game', 'de')
('still', 'reckon')
('reckon', 'fallout')
('fallout', 'emmett')
('emmett', 'till')
('till', 'paint')
('paint', 'chasten')
('chasten', 'artist')
('artist', 'reveal')
('reveal', 'controversi')
('controversi', 'chang')
('chang', 'even')
('even', 'move')
('move', 'forward')
('forward', 'new')
('new', 'galleri')
('galleri', 'show')
('still', 'reckon', 'fallout')
('reckon', 'fallout', 'emmett')
('fallout', 'emmett', 'till')
('emmett', 'till', 'paint')
('till', 'paint', 'chasten')
('paint', 'chasten', 'artist')
('chasten', 'artist', 'reveal')
('artist', 'reveal', 'controversi')
('reveal', 'controversi', 'chang')
('controversi', 'chang', 'even')
('chang', 'even', 'move')
('even', 'move', 'forward')
('move', 'forward', 'new')
('forward', 'new', 'galleri')
('new', 'galleri', 'show')
('far', 'arik')
('arik', 'ogunbowal')
('ogunbowal', 'coach')
('coach', 'muffet')
('muffet', 'mcgraw')
('mcgraw', 'concern')
('concern', 'notr')
('notr', 'dame')
('dame', 'victori')
('victori', 'louisvil')
('louisvil', 'thursday')
('thursday', 'night')
('night', 'anoth')
('anoth', 'atlant')
('atlant', 'coast')
('coast', 'confer')
('confer', 'game')
('game', 'januari')
('far', 'arik', 'ogunbowal')
('arik', 'ogunbowal', 'coach')
('ogunbowal', 'coach', 'muffet')
('coach', 'muffet', 'mcgraw')
('muffet', 'mcgraw', 'concern')
('mcgraw', 'concern', 'notr')
('concern', 'notr', 'dame')
('notr', 'dame', 'victori')
('dame', 'victori', 'louisvil')
('victori', 'louisvil', 'thursday')
('louisvil', 'thursday', 'night')
('thursday', 'night', 'anoth')
('night', 'anoth', 'atlant')
('anoth', 'atlant', 'coast')
('atlant', 'coast', 'confer')
('coast', 'confer', 'game')
('confer', 'game', 'januari')
('prohibit', 'vacat')
('vacat', 'rental')
('rental', 'arrang')
('arrang', 'onlin')
('onlin', 'airbnb')
('airbnb', 'move')
('move', 'closer')
('closer', 'realiti')
('realiti', 'thursday')
('thursday', 'new')
('new', 'orlean')
('prohibit', 'vacat', 'rental')
('vacat', 'rental', 'arrang')
('rental', 'arrang', 'onlin')
('arrang', 'onlin', 'airbnb')
('onlin', 'airbnb', 'move')
('airbnb', 'move', 'closer')
('move', 'closer', 'realiti')
('closer', 'realiti', 'thursday')
('realiti', 'thursday', 'new')
('thursday', 'new', 'orlean')
('contamin', 'food')
('food', 'smell')
('smell', 'like')
('like', 'freedom')
('contamin', 'food', 'smell')
('food', 'smell', 'like')
('smell', 'like', 'freedom')
('end', 'sight')
('sight', 'partial')
('partial', 'feder')
('feder', 'shutdown')
('shutdown', 'distress')
('distress', 'feder')
('feder', 'worker')
('worker', 'paycheck')
('paycheck', 'sight')
('sight', 'either')
('end', 'sight', 'partial')
('sight', 'partial', 'feder')
('partial', 'feder', 'shutdown')
('feder', 'shutdown', 'distress')
('shutdown', 'distress', 'feder')
('distress', 'feder', 'worker')
('feder', 'worker', 'paycheck')
('worker', 'paycheck', 'sight')
('paycheck', 'sight', 'either')
('bottleneck', 'offload')
('offload', 'import')
('import', 'fuel')
('fuel', 'form')
('form', 'mexican')
('mexican', 'oil')
('oil', 'port')
('port', 'follow')
('follow', 'govern')
('govern', 'order')
('order', 'shut')
('shut', 'pipelin')
('pipelin', 'limit')
('limit', 'loss')
('loss', 'widespread')
('widespread', 'fuel')
('fuel', 'theft')
('theft', 'accord')
('accord', 'trader')
('trader', 'refinitiv')
('refinitiv', 'eikon')
('eikon', 'data')
('bottleneck', 'offload', 'import')
('offload', 'import', 'fuel')
('import', 'fuel', 'form')
('fuel', 'form', 'mexican')
('form', 'mexican', 'oil')
('mexican', 'oil', 'port')
('oil', 'port', 'follow')
('port', 'follow', 'govern')
('follow', 'govern', 'order')
('govern', 'order', 'shut')
('order', 'shut', 'pipelin')
('shut', 'pipelin', 'limit')
('pipelin', 'limit', 'loss')
('limit', 'loss', 'widespread')
('loss', 'widespread', 'fuel')
('widespread', 'fuel', 'theft')
('fuel', 'theft', 'accord')
('theft', 'accord', 'trader')
('accord', 'trader', 'refinitiv')
('trader', 'refinitiv', 'eikon')
('refinitiv', 'eikon', 'data')
('follow', 'reaction')
('reaction', 'andi')
('andi', 'murray')
('murray', 'announc')
('announc', 'friday')
('friday', 'year')
('year', 'australian')
('australian', 'open')
('open', 'could')
('could', 'last')
('last', 'tournament')
('tournament', 'profession')
('follow', 'reaction', 'andi')
('reaction', 'andi', 'murray')
('andi', 'murray', 'announc')
('murray', 'announc', 'friday')
('announc', 'friday', 'year')
('friday', 'year', 'australian')
('year', 'australian', 'open')
('australian', 'open', 'could')
('open', 'could', 'last')
('could', 'last', 'tournament')
('last', 'tournament', 'profession')

Visualizing word counts with word clouds

  • World cloud is an effective way of visualizing word counts
# Construct a word cloud from corpus.
wordcloud = WordCloud(max_font_size = 40, background_color = "white", collocations = False)
wordcloud = wordcloud.generate(' '.join(df_clean_list))
plt.figure(figsize = (27, 20)) # Plot the cloud using matplotlib.
plt.imshow(wordcloud, interpolation = "bilinear")
plt.axis("off")
  • What words are most common in the text data that we just cleaned?
  • Based on that, what do you think these documents focus on?

Knowledge check

centered

Exercise

centered

You are now ready to try Tasks 7-9 in the Exercise for this topic

Module completion checklist

Objective Complete
Create Term-Document Matrix

Explore the distribution of words in corpus

Text Processing: Topic summary

In this part of the course, we have covered:

  • The need for text processing and the tools used to perform it
  • Definition and implementation of text processing steps
  • Word distribution in a corpus

Congratulations on completing this module!

icon-left-bottom